This doc will cover some applications of simple and complex models to our Alzheimer’s dataset, and talk about how the results could be applied or interpereted much like the work originally done in the paper “Multiplexed immunoassay panel identifies novel CSF biomarkers for Alzheimer’s disease diagnosis and prognosis”.
Some of the topics covered:
Load in the same data as before and run some basic analysis to get back in the munging mood:
# Load in the whole Alzheimers data file
data <- read.csv('~/repos/portfolio/demonstrative/R/datasets/alzheimers/alzheimers.csv')
# Remove this field ... I forgot to do that when creating the dataset
data <- data %>% select(-male)
# Normalize gender labels
stopifnot(all(!is.na(data$gender)))
normalize.gender <- function(x) {
is.male <- x %>% tolower %>% str_detect('^m')
ifelse(is.male, 'Male', 'Female') %>% factor
}
data <- data %>% mutate(gender=normalize.gender(gender))
# Convert integer fields to numeric for the sake of consistency
data <- data %>% mutate_each_(funs(as.numeric), c('Betacellulin', 'Eotaxin_3'))
head(data)
# Parse our data into predictors and a binary response
# (Impaired vs Not Impaired) and check response frequency
X <- data %>% select(-response)
y <- data[,'response']
table(y)
y
Impaired NotImpaired
91 242
table(y) / length(y)
y
Impaired NotImpaired
0.2732733 0.7267267
# Check out the names of the predictors we have:
names(X)
[1] "ACE_CD143_Angiotensin_Converti" "ACTH_Adrenocorticotropic_Hormon"
[3] "AXL" "Adiponectin"
[5] "Alpha_1_Antichymotrypsin" "Alpha_1_Antitrypsin"
[7] "Alpha_1_Microglobulin" "Alpha_2_Macroglobulin"
[9] "Angiopoietin_2_ANG_2" "Angiotensinogen"
[11] "Apolipoprotein_A_IV" "Apolipoprotein_A1"
[13] "Apolipoprotein_A2" "Apolipoprotein_B"
[15] "Apolipoprotein_CI" "Apolipoprotein_CIII"
[17] "Apolipoprotein_D" "Apolipoprotein_E"
[19] "Apolipoprotein_H" "B_Lymphocyte_Chemoattractant_BL"
[21] "BMP_6" "Beta_2_Microglobulin"
[23] "Betacellulin" "C_Reactive_Protein"
[25] "CD40" "CD5L"
[27] "Calbindin" "Calcitonin"
[29] "CgA" "Clusterin_Apo_J"
[31] "Complement_3" "Complement_Factor_H"
[33] "Connective_Tissue_Growth_Factor" "Cortisol"
[35] "Creatine_Kinase_MB" "Cystatin_C"
[37] "EGF_R" "EN_RAGE"
[39] "ENA_78" "Eotaxin_3"
[41] "FAS" "FSH_Follicle_Stimulation_Hormon"
[43] "Fas_Ligand" "Fatty_Acid_Binding_Protein"
[45] "Ferritin" "Fetuin_A"
[47] "Fibrinogen" "GRO_alpha"
[49] "Gamma_Interferon_induced_Monokin" "Glutathione_S_Transferase_alpha"
[51] "HB_EGF" "HCC_4"
[53] "Hepatocyte_Growth_Factor_HGF" "I_309"
[55] "ICAM_1" "IGF_BP_2"
[57] "IL_11" "IL_13"
[59] "IL_16" "IL_17E"
[61] "IL_1alpha" "IL_3"
[63] "IL_4" "IL_5"
[65] "IL_6" "IL_6_Receptor"
[67] "IL_7" "IL_8"
[69] "IP_10_Inducible_Protein_10" "IgA"
[71] "Insulin" "Kidney_Injury_Molecule_1_KIM_1"
[73] "LOX_1" "Leptin"
[75] "Lipoprotein_a" "MCP_1"
[77] "MCP_2" "MIF"
[79] "MIP_1alpha" "MIP_1beta"
[81] "MMP_2" "MMP_3"
[83] "MMP10" "MMP7"
[85] "Myoglobin" "NT_proBNP"
[87] "NrCAM" "Osteopontin"
[89] "PAI_1" "PAPP_A"
[91] "PLGF" "PYY"
[93] "Pancreatic_polypeptide" "Prolactin"
[95] "Prostatic_Acid_Phosphatase" "Protein_S"
[97] "Pulmonary_and_Activation_Regulat" "RANTES"
[99] "Resistin" "S100b"
[101] "SGOT" "SHBG"
[103] "SOD" "Serum_Amyloid_P"
[105] "Sortilin" "Stem_Cell_Factor"
[107] "TGF_alpha" "TIMP_1"
[109] "TNF_RII" "TRAIL_R3"
[111] "TTR_prealbumin" "Tamm_Horsfall_Protein_THP"
[113] "Thrombomodulin" "Thrombopoietin"
[115] "Thymus_Expressed_Chemokine_TECK" "Thyroid_Stimulating_Hormone"
[117] "Thyroxine_Binding_Globulin" "Tissue_Factor"
[119] "Transferrin" "Trefoil_Factor_3_TFF3"
[121] "VCAM_1" "VEGF"
[123] "Vitronectin" "von_Willebrand_Factor"
[125] "age" "tau"
[127] "p_tau" "Ab_42"
[129] "Genotype" "gender"
# Base R to accomplish the same:
# table(sapply(X, class))
# Look at what class each predictor has:
X %>% sapply(class) %>% table
.
factor numeric
2 128
# Pipeline to accomplish the same:
# X %>% sapply(class) %>% .[. == 'factor']
# Identify the factor variables, since they also need special attention:
names(X)[sapply(X, class) == 'factor']
[1] "Genotype" "gender"
We can start by looking at the relationship between some of the more intuitive variables like Age, Gender, and Genotype and impairment, to see if there are any obvious relationships.
data %>%
mutate(age_range=cut(age, breaks=5)) %>%
group_by(age_range, response) %>% tally %>%
plot_ly(x=~age_range, y=~n, color=~response, type='bar') %>%
plotly::layout(hovermode='closest', title='Age vs Impairment')
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
data %>%
group_by(Genotype, response) %>% tally %>%
plot_ly(x=~Genotype, y=~n, color=~response, type='bar') %>%
plotly::layout(hovermode='closest', title='Genotype vs Impairment')
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
data %>%
group_by(gender, response) %>% tally %>%
plot_ly(x=~gender, y=~n, color=~response, type='bar') %>%
layout(hovermode='closest', title='Gender vs Impairment')
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
Now what about the other ~125 variables? We can’t look at them all one at a time so perhaps there is a way to “condense” them into something more manageable:
library(corrplot)
d.pca <- X %>% select(-gender, -Genotype)
cor.mat <- corrplot(cor(d.pca), order='hclust', tl.col='black', tl.cex=.5)
# Extract the variable names from the figure below since they will be useful
# for creating similar plots for comparison (and it's easier to compare in the same order)
cor.var <- rownames(cor.mat)
# Run PCA, and make sure scaling is on since we didn't do that manually
pca <- prcomp(d.pca, scale=T)
Check how well the Principal Components capture the correlated groups of variables:
data.frame(Cumulative.Variance=summary(pca)$importance[3,]) %>% mutate(PC.Number=1:n()) %>%
plot_ly(x=~PC.Number, y=~Cumulative.Variance, mode='lines', type='scatter') %>%
layout(title='Cumulative Variance Explained by Principal Components')
One way to look at how much each PC is related to a feature is to look at the correlation between the feature itself and it’s transformed version in the new Principal Compenent space .. err a simpler way to put that is to say we have some “transformation” (which could be from PCA or anything else), which will assign a new value to every feature for each observation. Then we can just look at how the original values correlate with the transformed ones to see what the transformation is actually doing:
original.data <- d.pca[,cor.var]
pca.data <- pca$x[,1:25] # Using the first 25 PCs only, for brevity
corrplot(cor(original.data, pca.data), tl.col='black', tl.cex=.5)
First we have to do a little better of extra massaging of the predictor data to make sure that all the different model types within caret will be able to use it (namely, encoding factors in some numeric way).
# We currently have factors like this:
X %>% select(starts_with('gender'), starts_with('Genotype')) %>% head
We don’t want these to remain factors or a lot of different caret models will choke. Luckily there’s a caret function that makes that pretty easy:
# Here, the dummyVars function will automatically detect and turn factor values into separate columns
X.m <- predict(dummyVars(~., X), X) %>% as.data.frame
X.m %>% select(starts_with('gender'), starts_with('Genotype')) %>% head
These can be ignored for now but are good to come back to for reference on specific configurations and such:
# These functions are mostly just precursors to the following training commands, and
# serve to do things like specify cross validation, how performance measures are
# calculated, create ensemble models, apply PCA to data subsets, etc.
#
# This isn't a caret thing, it's just my own thing I use to help
# make it more convenient to save model results on disk in a more
# convenient way
tr <- proj$getTrainer()
#
# This function with define how performance measures are calculated for each "fold"
summary.function <- function(...){
arg.list <- list(...)
if (!('Impaired' %in% names(arg.list[[1]])))
arg.list[[1]]$Impaired <- arg.list[[1]]$classProb
c(do.call(twoClassSummary, arg.list), do.call(defaultSummary, arg.list))
}
#
# Training control -- standard caret stuff
tc <- trainControl(
classProbs=T, method='cv', number=10,
summaryFunction = summary.function,
verboseIter=F, savePredictions='final', returnResamp='final', allowParallel=T
)
#
# This is an ensemble specification that will average predictions from
# multiple models together
get.ensemble.model <- function(tuneList){
caret.list.args <- list(
trControl=trainControl(
method='cv', number=5, classProbs=T,
summaryFunction = function(...) c(twoClassSummary(...), defaultSummary(...)),
returnData=F, savePredictions='final',
allowParallel=T, verboseIter=F
),
tuneList=tuneList
)
caret.stack.args <- list(
method=GetEnsembleAveragingModel(),
trControl=trainControl(method='none', savePredictions = 'final', classProbs=T,
summaryFunction = function(...) c(twoClassSummary(...), defaultSummary(...)),
returnResamp='final')
)
GetCaretEnsembleModel(caret.list.args, caret.stack.args)
}
#
# Create a vector containing the names of features we do NOT want to preprocess with PCA
non.pca.vars <- X.m %>% select(starts_with('Gender'), starts_with('Genotype'), age) %>% names
#
# Two utility functions for splitting predictor data into separate data frames as well
# as merging them back together before fitting models with the result
pca.split <- function(X) {
list(
var=X %>% dplyr::select(one_of(non.pca.vars)),
pca=X %>% dplyr::select(-one_of(non.pca.vars))
)
}
pca.combine <- function(pp, X) {
X <- pca.split(X)
cbind(X$var, predict(pp, newdata=X$pca))
}
In order to use PCA within the caret modeling framework, you have 3 main options:
Option 3 is likely the best choice and caret makes it easy to override any part of its built in models (though understanding the consequences of overriding things is sometimes tough). Here is an example of this below:
# Example Caret custom model specification
# PCA preprocessing model wrapper
#
# This function will take any valid caret "method" name
# and wrap its fit/predict/prob functions with necessary logic to
# apply PCA to data subsets
get.model <- function(model.name){
# All Caret models are represented as lists, retrievable using "getModelInfo"
# Each of these lists has a variety of functions keyed by names like:
# "fit" - This function takes in data to fit the model on and should return a fit model object
# "prob" - This function in the list should return predicted probabilities
# "grid" - This function is responsible for creating hyperparameter grids to train over
# And several more ...
model <- getModelInfo()[[model.name]]
m <- model
# In this case, we're overriding the standard "fit" function with a new function
# that will split the input data, apply PCA to some of it, and then fit a model
# on the resulting, smaller dataset
m$fit <- function(x, y, wts, param, lev, last, classProbs, ...){
X <- pca.split(x)
pp <- preProcess(X$pca, method=c('center', 'scale', 'pca'), pcaComp = 40)
X.train <- cbind(X$var, predict(pp, newdata=X$pca))
modelFit <- model$fit(X.train, y, wts, param, lev, last, classProbs, ...)
modelFit$pp <- pp
modelFit$feature.names <- names(X.train)
modelFit
}
# The "predict" and "prob" methods must be overriden as well to apply PCA
# to any new data given to make predictions on
m$predict <- function (modelFit, newdata, submodels = NULL) {
X.test <- pca.combine(modelFit$pp, newdata)
model$predict(modelFit, X.test, submodels)
}
m$prob <- function (modelFit, newdata, submodels = NULL){
X.test <- pca.combine(modelFit$pp, newdata)
model$prob(modelFit, X.test, submodels)
}
# And as is par for the course with Caret, there are always a couple
# of model types that don't always work as expected when you do custom
# things like this. In this case, xgbTree models were using the original
# feature names to get "importances" of each so that must be overriden
# here to use the reduced feature list instead
if (model.name == 'xgbTree'){
m$varImp = function(object, numTrees = NULL, ...) {
imp <- xgb.importance(object$feature.names, model = object)
imp <- as.data.frame(imp)[, 1:2]
rownames(imp) <- as.character(imp[,1])
imp <- imp[,2,drop = FALSE]
colnames(imp) <- "Overall"
imp
}
}
m
}
With all the precursor stuff in place, the caret models can be called with different tuning settings and in this case, the wrapper function above will also make sure PCA is applied to protein features:
models <- list(
tr$getModel('pca_glm', method=get.model('glm'), trControl=tc),
tr$getModel('pca_glmnet', method=get.model('glmnet'), trControl=tc, tuneLength=5),
tr$getModel('pca_rf', method=get.model('rf'), trControl=tc, tuneGrid=expand.grid(.mtry = c(2,4,8))),
tr$getModel('pca_rpart', method=get.model('rpart'), tuneLength=10, trControl=tc),
tr$getModel('pca_gbm', method=get.model('gbm'), tuneLength=5, trControl=tc, verbose=F),
tr$getModel('pca_xgb', method=get.model('xgbTree'), tuneLength=5, trControl=tc),
tr$getModel('pca_spline', method=get.model('earth'), preProcess=pre.proc, trControl=tc, tuneLength=5),
tr$getModel('pca_nnet', method=get.model('nnet'), preProcess=pre.proc, trControl=tc, tuneLength=5, trace=F)
)
names(models) <- sapply(models, function(m) m$name)
# Loop through the models and call "train" on each
pca.results <- lapply(models, function(m) tr$train(m, X.m, y, enable.cache=T)) %>% setNames(names(models))
2016-11-10 08:10:09 INFO::Beginning training for model "pca_glm" (cache name = "model_pca_glm")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_glm.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_glm"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_glmnet" (cache name = "model_pca_glmnet")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_glmnet.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_glmnet"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_rf" (cache name = "model_pca_rf")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_rf.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_rf"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_rpart" (cache name = "model_pca_rpart")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_rpart.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_rpart"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_gbm" (cache name = "model_pca_gbm")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_gbm.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_gbm"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_xgb" (cache name = "model_pca_xgb")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_xgb.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_xgb"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_spline" (cache name = "model_pca_spline")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_spline.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_spline"
2016-11-10 08:10:09 INFO::Beginning training for model "pca_nnet" (cache name = "model_pca_nnet")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_pca_nnet.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "pca_nnet"
For comparison, also train a bunch of models on the entire dataset:
library(caretEnsemble)
Attaching package: ‘caretEnsemble’
The following object is masked from ‘package:ggplot2’:
autoplot
pre.proc <- c('center', 'scale')
# This "ensemble" model will combine predictions from 3 separate other models
ens.model <- list(
glmnet=caretModelSpec(method='glmnet', preProcess=pre.proc, tuneLength=5),
gbm=caretModelSpec(method='gbm', tuneLength=5, verbose=F),
spline=caretModelSpec(method='earth', tuneLength=5, preProcess=pre.proc)
)
models <- list(
tr$getModel('glm', method='glm', preProcess=pre.proc, trControl=tc),
tr$getModel('glmnet', method='glmnet', preProcess=pre.proc, trControl=tc, tuneLength=5),
tr$getModel('nnet', method='nnet', preProcess=pre.proc, trControl=tc, tuneLength=5, trace=F),
tr$getModel('rpart', method='rpart', tuneLength=10, trControl=tc),
tr$getModel('gbm', method='gbm', tuneLength=5, trControl=tc, verbose=F),
tr$getModel('xgb', method='xgbTree', tuneLength=5, trControl=tc),
tr$getModel('spline', method='earth', preProcess=pre.proc, trControl=tc, tuneLength=5),
tr$getModel('rf', method='rf', trControl=tc, tuneLength=5),
tr$getModel('ensemble', method=get.ensemble.model(ens.model), trControl=tc)
)
names(models) <- sapply(models, function(m) m$name)
# Again, loop through the models and run the training process
all.results <- lapply(models, function(m) tr$train(m, X.m, y, enable.cache=T)) %>% setNames(names(models))
2016-11-10 08:10:09 INFO::Beginning training for model "glm" (cache name = "model_glm")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_glm.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "glm"
2016-11-10 08:10:09 INFO::Beginning training for model "glmnet" (cache name = "model_glmnet")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_glmnet.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "glmnet"
2016-11-10 08:10:09 INFO::Beginning training for model "nnet" (cache name = "model_nnet")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_nnet.Rdata" from disk
2016-11-10 08:10:09 INFO::Training complete for model "nnet"
2016-11-10 08:10:09 INFO::Beginning training for model "rpart" (cache name = "model_rpart")
2016-11-10 08:10:09 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_rpart.Rdata" from disk
2016-11-10 08:10:10 INFO::Training complete for model "rpart"
2016-11-10 08:10:10 INFO::Beginning training for model "gbm" (cache name = "model_gbm")
2016-11-10 08:10:10 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_gbm.Rdata" from disk
2016-11-10 08:10:10 INFO::Training complete for model "gbm"
2016-11-10 08:10:10 INFO::Beginning training for model "xgb" (cache name = "model_xgb")
2016-11-10 08:10:10 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_xgb.Rdata" from disk
2016-11-10 08:10:10 INFO::Training complete for model "xgb"
2016-11-10 08:10:10 INFO::Beginning training for model "spline" (cache name = "model_spline")
2016-11-10 08:10:10 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_spline.Rdata" from disk
2016-11-10 08:10:10 INFO::Training complete for model "spline"
2016-11-10 08:10:10 INFO::Beginning training for model "rf" (cache name = "model_rf")
2016-11-10 08:10:10 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_rf.Rdata" from disk
2016-11-10 08:10:10 INFO::Training complete for model "rf"
2016-11-10 08:10:10 INFO::Beginning training for model "ensemble" (cache name = "model_ensemble")
2016-11-10 08:10:10 DEBUG::Restoring cached object "~/data/meetups/r_tutorials/tutorial_03/cache/models/model_ensemble.Rdata" from disk
2016-11-10 08:10:10 INFO::Training complete for model "ensemble"
Caret attaches a ton of information to each fit model object:
# The "model" object itself
pca.results[['pca_glm']]$fit$finalModel
Call: NULL
Coefficients:
(Intercept) gender.Female gender.Male Genotype.E2E2 Genotype.E2E3 Genotype.E2E4 Genotype.E3E3
-6.7028867 1.2825389 NA 12.6556281 1.3260370 3.1554931 1.1988066
Genotype.E3E4 Genotype.E4E4 age PC1 PC2 PC3 PC4
1.8173419 NA 0.0848972 0.1574130 -0.0532327 1.0802509 -0.0243411
PC5 PC6 PC7 PC8 PC9 PC10 PC11
-0.4305046 -0.3790200 0.3341016 -0.8673593 0.1649761 0.4377114 0.4745032
PC12 PC13 PC14 PC15 PC16 PC17 PC18
0.5366593 0.2424596 -0.0186294 -0.0318212 -0.0480346 0.3106117 -0.2465031
PC19 PC20 PC21 PC22 PC23 PC24 PC25
0.2596782 -0.5054591 -0.4318682 -0.1894219 0.4396659 -0.0002921 0.4673352
PC26 PC27 PC28 PC29 PC30 PC31 PC32
0.5351138 -0.4806222 -0.1748346 0.2037984 -0.4253246 0.3143912 0.1927480
PC33 PC34 PC35 PC36 PC37 PC38 PC39
-0.0874630 -0.1230627 -0.4731087 -0.1464317 -0.0408122 0.7974912 -0.0068709
PC40
0.2124745
Degrees of Freedom: 332 Total (i.e. Null); 285 Residual
Null Deviance: 390.6
Residual Deviance: 155.9 AIC: 251.9
# Resampling results for every hyperparameter combination
all.results[['xgb']]$fit$results %>% head
rbind(GetResampleData(pca.results), GetResampleData(all.results)) %>%
mutate(model=reorder(model, kappa, median)) %>%
plot_ly(x=~model, y=~kappa, type='box') %>%
layout(margin=list(b=100), title='Model Performance')
rbind(GetResampleData(pca.results), GetResampleData(all.results)) %>% head
metric <- 'kappa'
dt <- rbind(GetResampleData(pca.results), GetResampleData(all.results))
dt.ens <- dt %>% filter(model == 'ensemble') %>% arrange(resample) %>% .[,metric]
dt %>% group_by(model) %>% do({
d <- .
d <- d %>% arrange(resample)
# print(d)
if (d$model[1] == 'ensemble'){
data.frame(p=1)
} else {
r <- t.test(dt.ens, data.frame(d)[,metric], paired=T, alternative='greater')
data.frame(p=r$p.value)
}
}) %>% ungroup %>% arrange(p) %>%
mutate(model=reorder(model, p, max)) %>%
plot_ly(x=~model, y=~p, type='scatter', mode='lines') %>%
layout(margin=list(b=100,r=100), title='T-Test Statistics')
var.imp <- GetVarImp(all.results)
Loading required package: nnet
Loading required package: gbm
Loading required package: survival
Attaching package: ‘survival’
The following object is masked from ‘package:caret’:
cluster
Loading required package: splines
Loaded gbm 2.1.1
Loading required package: earth
Loading required package: plotmo
Loading required package: plotrix
Loading required package: TeachingDemos
Attaching package: ‘TeachingDemos’
The following object is masked from ‘package:plotly’:
subplot
var.imp %>%
mutate(feature=reorder(var.imp$feature, var.imp$score, mean)) %>%
plot_ly(x=~feature, y=~score, color=~model, mode='markers', type='scatter') %>%
layout(margin=list(b=200), title='Feature Importance (no PCA Features)')
var.model.names <- names(pca.results)[!str_detect(names(pca.results), 'nnet')]
var.imp <- GetVarImp(pca.results[var.model.names])
var.imp %>%
mutate(feature=reorder(var.imp$feature, var.imp$score, mean)) %>%
plot_ly(x=~feature, y=~score, color=~model, mode='markers', type='scatter') %>%
layout(margin=list(b=200), title='Feature Importance (w/ PCA Features)')
pd.vars <- c(
'age', 'gender.Male', 'gender.Female', 'tau',
'Cystatin_C', 'Ab_42', 'Cortisol', 'VEGF',
'NT_proBNP'
)
pd.models <- c('xgb', 'gbm', 'glmnet', 'spline', 'ensemble')
pred.fun <- function(object, newdata) {
if ('caretStack' %in% class(object$finalModel)){
pred <- predict(object$finalModel, newdata=newdata, type='prob')
} else {
pred <- predict(object, newdata=newdata, type='prob')
}
if (is.vector(pred)) pred
else pred[,1]
}
options(error=recover)
registerCores(1) # Increase this to make PD calcs faster
pd.data <- GetPartialDependence(
all.results[pd.models], pd.vars, pred.fun,
X=X.m, # This can come from model objects but only if returnData=T in trainControl
grid.size=50, grid.window=c(0, 1), # Resize these to better fit range of data
sample.rate=1, # Decrease this if PD calculations take too long
verbose=F, seed=SEED
)
First, to help understand where partial dependence comes from it’s helpful to look at the “ICE” (Individual Conditional Expectation) curves. This is a much simpler process than the name makes it sound .. computing these involves nothing more than picking an observation, altering the value of one predictor for that observation, and seeing how the predicted probability from a model changes as that predictor changes:
pd.all <- foreach(pd=pd.data, .combine=rbind)%do%
{ pd$pd %>% dplyr::mutate(predictor=pd$predictor, model=pd$model) }
# Plot the ICE curves for every observation in our dataset (each line represents one observation)
pd.all %>%
mutate(i=as.numeric(i)) %>%
ggplot(aes(x=x, y=y, group=i)) +
geom_line(show.legend=F, alpha=.1) +
theme_bw() +
facet_wrap(predictor~model, scales='free', ncol = length(pd.models)) +
ggtitle('Individual Conditional Expectation Curves (i.e. "Disaggregated" Partial Dependence)')
To make this simpler then, we could simply take the average predicted value across all the observations (i.e. individual lines) to give a single average predicted value for each feature value. This is what partial dependence is:
pd.hist.model <- pd.data[[1]]$model
pd.hist <- lapply(pd.data, function(pd){
if (pd$model != pd.hist.model) return(NULL)
else data.frame(pd$x) %>% setNames(pd$predictor)
}) %>% .[!sapply(., is.null)] %>% do.call('cbind', .) %>%
melt(id.vars=NULL, variable.name='predictor')
pd.mean <- foreach(pd=pd.data, .combine=rbind)%do%
{ pd$pd %>% dplyr::mutate(predictor=pd$predictor, model=pd$model) } %>%
dplyr::group_by(predictor, model, x) %>%
dplyr::summarise(y.mid=mean(y)) %>% ungroup
ggplot(NULL) +
geom_rug(aes(x=value), size=2, alpha=.1, data=pd.hist) +
geom_line(aes(x=x, y=y.mid, color=model), data=pd.mean) +
facet_wrap(~predictor, scale='free') +
theme_bw() +
ylab('Predicted Probability') + xlab('Predictor Value') +
ggtitle('Partial Dependence by Model (free Y scale)')
The y-scales above vary widely, so fixing them to be the same gives a much clearer picture of what features are affecting predictions the most:
ggplot(NULL) +
geom_rug(aes(x=value), size=2, alpha=.1, data=pd.hist) +
geom_line(aes(x=x, y=y.mid, color=model), data=pd.mean) +
facet_wrap(~predictor, scale='free_x') +
theme_bw() +
ylab('Predicted Probability') + xlab('Predictor Value') +
ggtitle('Partial Dependence by Model (fixed Y scale)')
glmnet.model <- all.results$glmnet$fit$finalModel
glmnet.coef <- predict(glmnet.model, s=all.results$glmnet$fit$bestTune$lambda, type='coefficients')
glmnet.coef <- glmnet.coef[,1]
glmnet.coef %>% data.frame %>% setNames('Coefficient') %>% add_rownames(var='Feature') %>%
filter(abs(Coefficient) > 0) %>%
mutate(Feature=reorder(Feature, Coefficient, mean)) %>%
plot_ly(x=~Feature, y=~Coefficient, type='bar') %>%
layout(margin=list(b=200), title='Glmnet Coefficients')
What we’ve done here comes close to inferring similar conclusions as those in the paper that first used the same data (mentioned in the intro). But, taking a lot of the same approaches into the world of business data leaves a lot to be desired IMO. ML models certainly surface useful signals for making predictive conclusions, but the way they do so without any prior knowledge of the problem or the inputs really limits how realistic they can ever be (e.g. the step functions or unbounded relationships are never really true).
A good middle ground is likely frameworks for building models, rather than applying predefined ones (e.g. caret), and hopefully we make such things a topic for another meetup? These kinds of frameworks make it possible to build models that generalize well but also learn realistic relationships, or at least in so far as they could ever be pre-determined.